-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validation warnings set stacklevel=2 #377
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #377 +/- ##
==========================================
+ Coverage 88.31% 88.65% +0.34%
==========================================
Files 19 19
Lines 2875 2971 +96
==========================================
+ Hits 2539 2634 +95
- Misses 336 337 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
... of course, @craffel already suggested this 3 years ago 😆 |
Thanks. What does this look like when you call a metric function directly? Regarding redundant calls to validate, this probably belongs in the other issue, but there must be some Python wizardry that basically says: If the same function is called with the same arguments within this context, skip the call. Basically some kind of argument caching/memoization but instead of caching the output, just skip the call completely, and evict the cache when you exit the context. That would work, right? |
Took the liberty of adding the In all cases, the default behavior is unchanged. Everything is validated by default. However, we now avoid doing multiple validation calls on the same data. Doing it this way means that I also took the liberty of bumping up the chord parser regexp into a module-level compiled pattern. Recompiling that monster regexp for every chord label was probably the source of the inefficiency I mentioned back in #341. I'll report back when all safety parameterizations are done. |
Hah, sorry, crossed wires here. 😁 The idea is that if the user calls If the user calls a metric directly with If the user calls either with Regarding memoization, yeah we could do that, but I don't think it actually buys us much here. It would add at least one dependency (eg joblib) and probably complicate the stacklevel logic more than the solution above. |
This works, but doesn't it seem drastic to put a new arg into basically all of the functions just to avoid redundant calls in evaluate? It feels possibly worth it to do the Python wizardry to avoid adding an arg everywhere. |
I wouldn't consider it drastic if it's implemented consistently across the entire package (which is my plan). It's perhaps a little clumsy, but it's easy to understand and use. A slick pythonic alternative would be to bundle everything in a decorator that abstracts out the validation / bypass logic before getting onto the core of the evaluation code. This might be a little fancier on the dev side, but the user experience would be pretty much the same. In this case, as much as I like a slick decorator trick, I'd favor readability and transparency. |
... forgot to comment that I'd finished with this retrofit, and I think it's ready for CR. Thinking more about potential python wizardry, I think the only reason to do that rather than the explicit parameterization I've implemented here is if we think we might change things later on. This kind of thing happens more for things like deprecation warnings / parameter renames / etc., which can change frequently over the lifecycle of a project, so there's benefit to making it easy to add or remove at any point. I don't think that applies in this case. I do think we need validation bypassing as implemented here if we want all of the following:
The only remaining question for me right now is whether the stacklevel should be 2 or 3. The call path looks something like:
main branch currently leaves it at 1, which is not helpful. PR currently does 2, which would point to either evaulate or the specific metric being called; this seems marginally more helpful. I think (3) might be best as it points to where the user is entering the mir_eval code path. The only downside here is that if a user is calling validate() directly, then the stack level might point one level higher up than it should. I expect this case is exceedingly rare and not something we should worry about, but I did want to note it here. |
Sorry for the delay in responding, as I mentioned my latency is a bit unpredictable these days. I appreciate the work you put into this but I don't think adding an additional argument to most user-facing functions in the library is the right solution - this hurts usability/readability significantly in addition to potential backwards compatibility issues with positional arguments and therefore should be a last resort. I also don't think it's important to support users disabling validation - the validation logic is there to catch problems that otherwise would cause issues and throw more helpful errors/warnings; the only reason it's in a separate function is to avoid code duplication in all of the metric functions. If there was no other option, I think it would make sense to consider adding an arg, but in this case the issue it solves is minor enough (repeated warnings for malformed inputs when evaluate is called) that I'm not even sure it's warranted. In any case I don't think we need to address this issue or come up with a solution for 0.8.0 so I would be most inclined to punt on it for now. Regarding the stack level, I'm not sure since IME the stack level reported by warnings is almost never helpful... but I agree that validate should almost never be called by an end-user so (3) is probably best. |
I don't think this affects positional arguments in existing code, as it's always the last new (positional) parameter. I also don't see how this hurts usability or readability: the default behavior is preserved, and the behavior is explicitly stated in documentation. At most it's one extra piece of information in the docs?
I'm okay with punting if you think it's really unimportant.
I actually do use warning stacklevels somewhat often, so if we have a chance to do it right, I think we should. That said, I don't think we can do it right without the change I've implemented here, since there are two alternative routes to the validate function with different stacklevels (user→metric→validate and user→evaluate→metric→validate). This is why I went to the effort of lifting out the validate call and adding bypass options at all. I agree with your earlier point that disabling validation entirely is of negligible utility, but I don't think there's another good option here that still produces a correct stacklevel. (A grungier alternative would be to pass the target stacklevel through to validate, but that's much less generally readable or usable than a bypass option.) |
Adding an argument to almost all user-facing functions that is almost never used harms readability and usability. I'm not saying it's catastrophic, I'm just saying it should be a last resort - if it can be avoided (which I believe it can) it should be avoided. Regarding the stack level, thinking about it more, I think stack level 2 is reasonable - it's flagging for the user that the warning is coming from some particular metric function. If that's inside an evaluate call, that's ok, because ultimately all evaluate does is do some preprocessing and then call the metric functions. I don't think we can in general assume some point in the stack level corresponds to the point where the user has constructed and passed off the data being evaluated, because one level above the evaluate/metric function call could either be user code or some other library code that the user is using on top of mir_eval. |
Ok, I'll close this out then. |
Just to clarify, I wasn't saying to nuke this entirely - I still think stacklevel=2 is a QoL improvement over the current behavior and is a simple change that should be merged in. |
Yeah I understand. But as I've said above, In [9]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:165: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:207: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:267: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:359: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:454: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats)
/home/bmcfee/git/mir_eval/mir_eval/beat.py:614: UserWarning: Estimated beats are empty.
validate(reference_beats, estimated_beats) Since these are coming from different lines in the module (due to stacklevel) these appear to be different, and are thus not absorbed by the In [6]: mir_eval.beat.evaluate(ref_data, np.array([]))
/home/bmcfee/git/mir_eval/mir_eval/beat.py:93: UserWarning: Estimated beats are empty.
warnings.warn("Estimated beats are empty.") Now, Python 3.12 actually added some functionality to do this in exactly the right way with the Which is all to say: I don't think setting stacklevel=2 universally is correct or more helpful than the current behavior, and I also don't see a cleaner way to achieve the most helpful behavior than what I'd implemented in this PR. |
Ah, I misunderstood the issue. I'll work on a solution to avoiding repeat calls to validate after my leave ends. |
Fixes #372
All validation functions now set stacklevel=2.
Before, line number comes from
validate
:After, line number comes from the metrics:
As it currently stands, we do a lot of redundant validation when using the
evaluate()
functions. It might be worth considering an enhancement whereevaluate()
validates the data once at the top, and then each underlying metric provides an option to bypass validation thereafter.I've proposed similar things in the past #341 as an optimization hack, but in this case, it's now also a usability issue because the relevant stacklevel to warn is different when metrics are called directly (2) vs called from
evaluate
(3).I think it could make sense to roll that into this PR, or not. Curious for other folks' input.